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Abstract 

Signal transducers and activators of transcription (STAT) proteins are key signalling molecules in metazoans, implicated in 
various cellular processes. Increased research in the field has resulted in the accumulation of STAT sequence and structure 
data, which are scattered across various public databases, missing extensive functional annotations, and prone to effort 
redundancy because of the dearth of community sharing. Therefore, there is a need to integrate the existing sequence, 
structure and functional data into a central repository, one that is enriched with annotations and provides a platform for 
community contributions. Herein, we present STATdb (publicly available at http://statdb.bic.nus.edu.sg/), the first integrated 
resource for STAT sequences comprising 1540 records representing the known STATome, enriched with existing structural 
and functional information from various databases and literature and including manual annotations. STATdb provides 
advanced features for data visualization, analysis and prediction, and community contributions. A key feature is a meta- 
predictor to characterise STAT sequences based on a novel classification that integrates STAT domain architecture, lineage 
and function. A curation policy workflow has been devised for regulated and structured community contributions, with an 
update policy for the seamless integration of new data and annotations. 
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Introduction 

Signal transducers and activators of transcription (STAT) 
proteins are one of the most important signalling molecules in 
metazoans [1,2,3], playing dual roles as cytoplasmic signalling 
proteins and nuclear transcription factors in the cell. STATs are 
key components of the Janus Kinase (JAK)/STAT signalling 
pathway [4], an evolutionarily conserved cascade that facilitates a 
wide range of inter- and intra-cellular signalling roles vital for 
cellular differentiation, growth and survival [5,6,7]. STATs get 
activated via phosphorylation by kinases, such as JAKs and Src 
kinases, and growth factor receptors, among other activating 
proteins, responding to extracellular-signalling proteins [8,9]. 
STAT proteins, upon activation, translocate to the nucleus to 
regulate a diverse set of target genes [1], however several 
deviations to this canonical pathway have been described to date 
[10]. Numerous studies have shown that dysregulation of the 
JAK/STAT pathway is associated with chronic inflammation, 
neurodegenerative diseases and cancer, among other disease states 
[11]. 

The STAT protein family in mammals comprises seven 
members— STAT 1-4, STAT5A and 5B, and STAT6— with 
diverse functions [1,11,12]. Knockout of either STAT1 or STAT2 
results in an impaired response to interferons [1]. Furthermore, 
the absence of STAT1 results in impaired growth control [13] 
whereas STAT2 knockout mice show numerous defects in their 



immune response [14]. Early embryonic lethality has been 
associated with STAT3 knockout mice [1,13], and additional 
complications, such as multiple defects in adult tissues and an 
impaired response to pathogens, are also linked to the absence of 
STAT3. STAT4 deletion affects T helper 1 (TH1) cell function, 
opposing STAT6 function, which impairs TH2 differentiation 
[1,13]. Both STAT5A and STAT5B are important for breast 
development/lactation: STAT5A is required for prolactin respon- 
siveness, whereas STAT5B is required for growth hormone 
responsiveness [1,13]. STAT5 refers to the gene that duplicated 
to give rise to STAT5A and STAT5B in species ancestral to 
mammals [15]. Both STAT5.1 and STAT5.2 are STAT5 
homologs in fishes [15]. 

STAT family of proteins has thus been studied intensively 
[1,11], which has led to the accumulation of sequence and 
structure data scattered across various public databases. For 
example, the primary NCBI sequence databases (GenBank and 
GenPept) are comprehensive but lack extensive functional 
annotations, such as status of experimental validation, STAT 
domains, interacting proteins, and gene and structural informa- 
tion, which are found in other databases, such as UniProt, RefSeq, 
PDB, Gene, CDD, and within the literature. Public databases, 
however, are prone to errors [16], and consequendy an extensive 
analysis is required to ascertain the reliability of data in public 
domains by cross-checking with other databases and with what is 
cited in the literature. This difficult task, along with the substantial 
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lack of sharing amongst the scientific community, has thus led to 
redundant efforts in the laboratory. Therefore, there is an urgent 
need to assemble, organize, remove duplicates and integrate 
existing sequence, structure and functional data into a central 
repository that is enriched with annotations and provides a 
platform for community contribution to allow for systematic, 
integrated analyses of STATs. 

Herein, we present STATdb, a specialised repository of STAT 
sequences, representing the known STATome, integrating existing 
sequence, structure and functional information from various 
databases, and the literature, and including manual annotations. 
This, to our knowledge, is the first reported specialised Web 
resource for STAT sequences. STATdb, besides the basic 
functionalities such as database query using keyword search and 
data download, provides advanced features for data visualization, 
analysis and prediction, and community contribution. Users can 
dynamically browse the STATome — the complete dataset of 
reported STAT sequence records in STATdb — and interactively 
view available 3D structures. STATdb is integrated with sequence 
analysis tools, such as the Basic Local Alignment Search Tool 
(BLAST) for sequence similarity searches and GlustalW for 
multiple sequence alignments on the fly. A key feature of the 
database is STATdbPredict, which is used to characterize STAT 
sequences based on a novel classification scheme that incorporates 
domain architecture, lineage and function. Sequence records are 
manually annotated with STATdb classification notation, exper- 
imental status validation, and individual domain sequences, 
among others. A submission/ curation policy workflow has been 
devised for regulated and structured contribution of new records 
and for enrichment/correction of functional annotations of 
existing records by the STAT research community (curator) 
through an easy-to-use interface. Community contribution, based 
on existing data and literature, is important in biological data- 
warehousing [17] and the approach has been highly successful, as 
exemplified by numerous Wiki-based projects: PDBwiki [18], 
WikiProteins [19], Gene Wiki [20], RNA Wikiproject [21], 
EcoliWiki [22] and WikiPathways [23]. Additionally, an update 
policy has been devised for the regular integration of new records 
and annotations from public databases and/ or the community. 

Materials and Methods 

Sequence Data Collection 

Protein and nucleotide sequences of STAT were first collected 
through keyword searches using the National Center for 
Biotechnology Information (NCBI) Reference Sequence (RefSeq) 
[24,25] database, followed by sequence similarity searches against 
all reported sequences in the NCBI non-redundant (NR) database 
[25]. Keyword hits were manually checked and verified as STAT 
according to the literature. Selected verified sequences were used 
as query for Position-Specific Iterated (PSI)-BLAST search [26] in 
order to perform a comprehensive survey of STAT sequences. 
Significant blast hits were selected and sequence duplicates were 
removed using CD-HIT [27]. The remaining non-redundant 
sequences were used to populate STATdb. 

Database Record Annotation 

Existing STAT record annotations in various public databases 
were studied to identify relevant fields for STATdb. The list of 
fields defined for STATdb records are provided in Table 1 . Fields 
that provide information selected from the source record (NCBI 
Entrez Protein Database) are marked as "Source", such as gene 
name, protein name, type of STAT, database cross-references, 
literature, species, location of the gene on the chromosome, length 



of the protein sequence, and amino acid sequence for the full- 
length protein, as well as the list of individual STAT domains. 
"Assigned" fields are those not found in the source record, but 
were included to provide information obtained from database 
cross-references and/or analysis of the sequence data, existing 
annotations or the literature. 

STATdb Classification 

STATdb-enriched annotations enabled the construction of a 
novel classification scheme for the characterization of STAT 
sequences and for the prediction of novel family members. This 
classification is based on a three-tier system: "Domain Architec- 
ture - Lineage - Function". 

"Domain Architecture", or "DA", is used to describe the 
observed order/ arrangement of STAT domains within the 
protein. STAT proteins comprise five major domains: protein 
interaction domain (STAT_int), all-alpha domain (STAT_alpha), 
DNA-binding domain (STAT_bind), SH2 domain (SH2) and the 
transactivation domain (TAZ2). The five unique domain archi- 
tectures are referred to as DA I - DA V, observed to date for 
STATdb sequences with DA U representing uncommon combi- 
nations and artificial sequences. 

"Lineage" is defined as a sub-classification of "Domain 
Architecture", and is based on the taxonomy of the species from 
which the STAT sequence was isolated. All STATdb sequences of 
each domain architecture were analysed for their species lineage 
by interrogating the NCBI Taxonomy Database and the 
sequences were then grouped according to the furthest common 
differentiation level from the root (i.e., cellular organism). As 
STATs are a family of paralogous loci (e.g., in vertebrates), the 
classification does not aim to coincide the species and gene trees in 
instances where it is not possible. 

"Function" is defined as a sub-classification of "Lineage", and is 
based on the role of the STAT family members. Although there 
are seven mammalian STATs (STAT 1-4, STATS A, STAT5B and 
STAT6), numerous other STATs are commonly found in fishes or 
invertebrates, and the functions of these other STATs are also 
incorporated in this sub-classification tier. 

STATdbPredict 

STATdbPredict is a meta-prediction system designed to 
characterize protein sequences based on STATdb classification. 
Users can submit one or more sequences in FASTA format to 
obtain a prediction of DA, lineage and/ or function. The 
prediction process involves querying for the presence of statistically 
reliable STAT domains using Reversed Position Specific (RPS)- 
BLAST [26] and classifying them based on "DA"; this is followed 
by identifying the highest scoring pair (HSP) for the prediction of 
"Lineage" (tier two) and "Function" (tier three) using BLASTp 
[26] (see "STATdb Home > Help > Tools: Predict" for the 
prediction algorithm of STATdbPredict). STAT domain Position 
Specific Scoring Matrices (PSSMs) were downloaded from the 
NCBI conserved domain database (CDD) and used to create a 
local in-house RPS-BLAST-searchable database. This in-house 
searchable database is of much smaller size than the original 
CDD; thus, the values of the search output parameters (E-value, 
percentage identity, alignment length and bit score) will not be the 
same between the in-house database and original CDD. Since the 
E-value appears to be inversely proportional to database size [26], 
its values are larger for the in-house database and, thus, are 
deemed not appropriate as a parameter for the selection of 
significant domain hits. Therefore, the experimentally verified 
STAT records were used to determine the acceptable value range 
for the remaining other vital parameters (percentage identity, 
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Table 1. List of all fields defined for STATdb records. 





Field Name 


Description 


Source' /Assigned 


STATdb Id 


STATdb Unique Identifier/Accession Number 


Assigned (by STATdb authors) 


gName 


Gene Name 


Source & Assigned (via NCBI Entrez Gene database) 


pName 


Protein Name 


Source 


STAT type 


STAT family sub-group based on function 


$ 

Source & Assigned (literature ) 


STATdb Classification 


Classification based on three-tier system 


Assigned (by STATdb authors) 


Domain Architecture - Lineage - Function 


DBXRef 


Database Cross References 


Source & Assigned (pathway information obtained via KEGG 
database and other cross references are from source) 


Literature 


Literature (PubMed Reference Id) 


Source 


Species (Source Organism) 


Species containing STAT 


Source 


Expt. Status 


Experimental Status 


Assigned (by STATdb authors) 


E - Experimentally Verified 


P - Predicted/Hypothetical 


U - Unknown 


Expt. Status Evidence 


Experimental Status Evidence 


Assigned (literature ) 


ChromLoc 


Chromosome Location 


Source 


IntPartners 


Interacting Proteins 


Assigned (via NCBI Entrez Gene database) 


SeqLen 


Sequence Length (Protein) 


Source 


Completeness 


Completeness of the protein sequence 


Assigned (by STATdb authors) 


Complete/Incomplete 


STAT Dom 


STAT domains 


Source 


DomArchitecture 


Domain Architecture 


Assigned (via SMART database) 


STAT DomSeq 


Nucleotide & Protein Sequence of STAT domains 


Assigned (derived from source) 


STATJnt 


Nucleotide & Protein Sequence for protein interaction domain 


Assigned (derived from source) 


STAT_alpha 


Nucleotide & Protein Sequence for all alpha domain 


Assigned (derived from source) 


STAT_bind 


Nucleotide & Protein Sequence for DNA binding domain 


Assigned (derived from source) 


STAT_sh2 


Nucleotide & Protein Sequence for SH2 domain 


Assigned (derived from source) 


STAT_taz2 


Nucleotide & Protein Sequence for TAZ2 domain 


Assigned (derived from source) 


BindingMotif 


DNA Binding Motif 


Assigned (via JASPAR database) 


NucSeq 


Nucleotide Sequence 


Assigned (via NCBI Entrez Nucleotide database) 


ProtSeq 


Protein Sequence 


Source 


Comment 


STATdb Curation Comments 


Assignable 



Fields that provide information selected from the source record (NCBI Entrez Protein database} are marked as "Source". "Assigned" fields are those not found in the 
source record, but were included to provide information obtained from analysis of the sequence data, existing annotations or the literature. 
^NCBI Entrez protein database. 

^The respective literature are indicated in the relevant records. 
doi:1 0.1 371 /joumal.pone.01 04597.t001 



alignment length and bit score) for each domain. The minimum 
range values of the three parameters (percentage identity, 
alignment length, and bit score) are used as a cut-off for statistical 
reliability of a domain hit (see "STATdb Home > Help > Tools: 
Predict" for the range values). The HSP is used to ascribe 
"Lineage" and "Function", and is defined as the best match to the 
query, with a percentage sequence identity of ^90 and a length 
difference of £10; predictions based on HSPs that do not meet 
these criteria are indicated as hits of low confidence. 

The accuracy of the prediction system was tested using a test 
dataset comprising new STAT sequences (as at June 2013) not 
found in STATdb (as at April 2013, STATdb comprised 1,424 
records). These new sequences were obtained using the PSI- 
BLAST search against the NCBI NR database. The search 
resulted in 116 new STAT sequences, of which 20 were assigned a 



"DA U". The remaining 96 classifiable sequences were non- 
redundant and used as positive samples for the test dataset, with 
the top 96 non-redundant, non-STAT hits from the PSI-BLAST 
used as negative samples. After this analysis was complete, the 1 16 
new STAT sequences were added to STATdb. 

STATdb Construction 

STATdb was created using MySQL (www.mysql.com) and the 
user interface was developed through the use of PHP, HTML and 
j Query. MySQL is used for data storage, processing and retrieval 
of specific information. PHP (www.php.net) pages are used to 
process the forms and browse through the different sections of the 
database. HTML was utilised for the website design, with dynamic 
record browsing according to different groupings facilitated by 
jQuery (www.jquery.com), which is used to manage all the Java 
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Scripts and AJAX. Analysis tools supported by BioSLAX (www. 
bioslax.com), such as BLAST similarity search and ClustalW [28] 
for multiple sequence alignments, are included in the database. 

Results 

Features of STATdb 

Each record in STATdb is given a unique Id in the form of 
"STAT_XXXXX", where "XXXXX" represents five numerical 
digits. A sample record is provided in Figure 1. The records 
comprise standard data fields from the source databases (NCBI 
Entrez Protein database) and "Assigned" fields, which are defined 
by the authors for enriched manual annotations: 

• STATdb Id - provides an Id for each individual STATdb 
record. 

• Gene name - provides the gene name obtained from the NCBI 
Entrez Gene database. 

• STAT type — STAT family sub-group. 

• STATdb classification - provides a notation that earmarks the 
characteristic features of the sequence in terms of "DA", 
"Lineage" and "Function" (see "Classification" section below 
for details). 

• DBXRef - provides database cross-references that are mostly 
obtained from the source record; however, pathway informa- 
tion is obtained from the KEGG database. 

• Experimental status validation - provides information from the 
literature and/or cross-referenced databases on the reliability 
of the STAT sequence, as either experimentally verified (E), or 
hypothetical/predicted (P) or unknown (U). 

• IntPartners - lists the interacting protein partners of STAT, 
which were obtained from NCBI Entrez Gene database. 

• Completeness of the protein sequence - the sequence is 
considered "complete" if all of the domains for the 
corresponding architecture are present. 

• STAT Domain architecture - describes the order of the STAT 
domains in the sequence (via SMART [29,30] database). 

• STAT Domain sequences — lists the amino acids and the 
corresponding nucleotide sequences (obtained by use of 
TBLASTN) of the individual STAT domains. 

• Binding Motif - provides the STAT binding motif and the 
predicted target gene information obtained from JASPAR 
database. 

• NucSeq - provides the nucleotide sequence obtained from the 
NCBI Entrez Nucleotide database for the corresponding 
protein. 

• STATdb curation comments - this provides a platform for 
annotations and/or corrections by the STATdb community. 

The key features of STATdb can be divided into basic and 
advanced, as described below: 
A. Basic: 

i. Keyword and Sequence Search 

Keyword queries of the database include STATdb_Id, gene 
name, protein name, STAT type, species, STAT domain, 
interacting proteins or other database cross-references. A sequence 
search is performed using BLAST against databases of (i) 
experimentally verified sequences, (ii) predicted, (iii) all STAT 
sequences (protein and nucleotides) and (iv) interacting partners 
(JAK, EGFR, and Src Kinase). 

ii. Downloads 



STATdb sequences categorized as "all sequences", "experi- 
mentally verified", "predicted" and sequences of interacting 
partners are available for downloaded in FASTA format from 
the download page. 

B. Advanced: 

i. Browser 

The Browser allows for dynamic browsing of the STATome 
according to all records, types of STAT, DNA or protein 
sequences, interacting proteins, status of experimental validation, 
and STAT DA (Figure 2A). Records can be selected to retrieve the 
full data or only the sequences in FASTA format, or they can be 
submitted for multiple sequence alignment on the fly using 
ClustalW. 

ii. View 3D 

The Jmol viewer allows for the manipulation of available 3D 
structures of STAT obtained from PDB. Currendy, there are only 
1 1 reported solved 3D structures for human (2), mouse (8) and the 
social amoeba Dictyostelium discoideum (1). Users can analyse the 
structures using the different options provided and also download 
primary sequences (FASTA format) and 3D structure coordinates 
(PDB format). 

iii. Contribute 

"Contribute" offers a platform for the STATdb community to 
curate annotations or submit new STAT sequences (Figure 2B). 
The submission of new STAT sequences will be checked and 
verified using the "Submission Policy" (see http://statdb.bic.nus. 
edu.sg/downloads/submission_policy.pdf). This would result in a 
database rich with annotations by expert curators in the field. 

iv. Classification 

STATs are complex proteins, but have been originally classified 
based simply on function and named according to their order of 
discovery (STAT types) [1,12,13]. The mammalian STAT family 
comprises seven different known members (STAT 1-4, STAT5A, 
STAT5B and STAT6), which correspond to a determined 
function (see Table W4 at "STATdb Home > Classification"), 
and other types commonly found in fishes or invertebrates. The 
"STAT (s)" annotation is used to refer to the family or species- 
specific STATs, and the "(s)" represents the literature name of the 
STAT in the particular species. This includes STAT (dstA to D) of 
Dictyoslelium discoideum and Polysphondylium pallidum PN500, 
STAT (D-STAT) of Drosophila melanogaster , and STAT (STA-1) 
and STAT (STATB) of C. elegans. Unknown, predicted or 
hypothetical STATs are denoted as STAT(u). 

Although sequences of a STAT type are described to share the 
same function, our analysis shows that they possess differences in 
their domain architecture and, in some cases, appear to be lineage- 
specific [10,31,32,33,34]. As such, there might be subde but 
distinct differences in the mode of function between family 
members of a STAT type, which merits further investigation. The 
rationale behind our classification system was to further stratify the 
original classification in a way that would allow for the quick 
delineation of possible structure, function and lineage of novel 
STATs. 

The analysis of STAT type by structure revealed five distinct 
domain architectures (see Table W2 at "STATdb Home > 
Classification"). Domain architecture I (DA I) contains all of the 
five domains in the order of STAT_int, STAT_alpha, STAT_- 
bind, SH2 and TAZ2 from N- to C-terminus (see Materials and 
Methods for their descriptions). DA II lacks the TAZ2 domain, 
whereas DA III lacks both the TAZ2 and the STAT_int. DA IV 
contains only the STAT_bind and SH2 domains, and DA V 
comprises the coiled-coil domain (Dict_STAT_coil) and the SH2 
domain. All other sequences that cannot be classified in this way — 
but contain or show similarity to at least one of the five major 
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A specialised resource for the STATome 



Home STATome Classification 

Home > Search > Keyword > Resuls > Sequence Record 



Tools 



Contribute 



Help 



Results 

Ckck here to comment or grve feedback on the record 



IntPartners 



STA- Dcm 



STATdb kt 


STAT_00001 




gName 


BGF-3; STAT9t; DKFZp686B04100; STAT1 


Signal transducer and activator of transection 
p 1-alpha/beta ooform alpha (STAT1) 


STAT type 


STAT1 


STATdb Classification f^L . 

(STATdb classification detats) 


DBXRef 
(Sequence) 


Gl (Protein) : 6274SS2 
GenPeotAcc No : NP 009330.1 
GenBank Acc No : NM 007315.3 


DBXRef Genekt : 6772 

(Gene S Pathway) KEGG Pathway Id hsa04630 


Llerature 


20331378 20347693 15322115 1496301814600148 12817007 12637327 12270932 12171910 12138178 11972023 11839738 11294897 

m;2.i;7 H257227 ic;;:s-- iC9ie;;: i:;;ic-; ;c 5?;;:?r r^-^c r;^;c2- 7;^;o2 7&s;s4i 7; me; 7R909S9 




1502203 1496401 




Species 


Homo sapiens 

(Lot of soeaes n this database can be found here) 


ChromLoc 2q32 2 


Expt Status 


191 


Expt Status Evidence Pubmed 


SeqLen 


750 


Completeness Complete 


CREBBP, GNB2L1; KIT. NMI. STAT1; SYK; CREBBP; EP300. ADRA1B. AKT1; BMX; BRCA1; CAMK2D. CAMK2G; CCR1. CCR5; CSE1L; 
CSF2RB; CXCR4. OOSP3; EGFR; EF1AO; EF2AK2; ELP2; FAOO; FANCC; FGFR3; FGFR4; FOS; FYN; GTF2t HSF1; FNAR2; FNGR1 ; L27RA; 



12RB; L2RG; RF1: RF2; RF9. JAK1; JAK2; JUN; KDR. KPNA1; KPNA6: LCK; LM02; MAPK14; UCM3: MCM5; MDK; PDGFRA; PDGFRB. 
PIAS1: PRKCO; PRMT1; PTK2; PTPN11; PTPN2; RAC1; RELA; RPS6KA5; SRC; STAR; STAT3; STATSA; STAT5B; SUM04. TNFRSF1A; 
TNFRSF1B; TRAOO; TYK2; UBE2I. VOR XP01; ZNF467; ACTN4; BCL3; C20orf185; CTNNBL1; DCTN1; HSP90AB1; KPNA2; LZTR1; HB01; 
MTOR; RNF11; SHANK1; SMARCA4; SPTAN1; SPTBN1;TRW28 
SH2 Domain 
DNA-Bndng Do man 
AI Alpha Do man 
Proten Interaction Doman 
Transacbvatxin Doman 



! 12! 222 



DomArchitecture 



STAT DomSeq 





1 r 










J" 

















' if. -Sll fi, 



> tool 1*2 1 It 



Domain archi tecture details by SMART 
Show Sequences (■») 



BndngMotif 




Ctck here for detais from the JASPAR core database 
Predicted Target gene: PAPPA 



NucSeq 


GCTCACCGC6 ...Sec aerc 




ProtSeq 






Comment 


NA 






® Coeynghl 2012-2013 STATdb AI Rsjnts Reserved. 


Browser best viewed on: Frefox 19.0 and above 
Resokiton best v*Md on: (1280 x 720) « above 



Figure 1. A sample STATdb record. 

doi:1 0.1 371 /journal.pone.01 04597.g001 
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STATdb 

A specialised resource for the STATome 



STATdb 




Figure 2. Snapshots of selected STATdb key features. A) STATome Browser - allows for the dynamic browsing of the STATome, a complete set 
of reported STAT records in STATdb. B) Contribute - provides a platform for the STATdb community to curate annotations or submit new STAT 
sequences. C) Classification - provides a notation that describes the grouping of a sequence based on our three-tier classification system: "Domain 
Architecture - Lineage - Function" and D) Predict - characterizes protein sequences using STATdb classification. 
doi:1 0.1 371 /journal.pone.01 04597.g002 



domains — are labeled as DA U. All artificial sequences, even if 
they share the observed orders, are still classified as DA U. 

The five domain architectures can be further differentiated into 
14 unique lineages, notated as "A" to "M", with "Z" for artificial 
sequences (see Table W3 at "STATdb Home > Classification"). 
Two lineages were observed for each bilateria (Deuterostomia and 
Protostomia), cnidaria (Anthozoa and Hydrozoa), choanoflagellida 
(Monosiga and Salpingoeca), and dictyosteliida (Dictyostelium and 
Polysphondylium), whereas one lineage was observed for each 
placozoa (trichoplax), porifera (demospongiae), ichthyosporea 
(Capsaspora), acanthamoeba and tracheophyta. 



The stratification of STAT types into "DA" and "Lineage" 
resulted in a three-tier classification system, with notations, such as 
"DA I : A : STAT1", which describes "DA" (tier one), "Lineage" 
(tier two), and "Function" (tier three), respectively (Figure 2C). 
Currendy, the website comprises 96 notations that involve three 
tiers (see Table Wl at "STATdb Home > Classification"). 
Searches can thus be performed according to these collective 
notations or to each individual tier; this information is provided 
under the "Classification" section ("STATdb Home > Classifi- 
cation") or can be found using the "Search" page ("STATdb 
Home > Search"). The classification will be updated regularly to 
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Processing Blast Search... 
Searching for domains... 
RESULTS 



# Query: STAT_00001 I STAT1 I Hoitio_sapiens 

# Processing. . . 

# Contains 



STAT domain, subject id 


% identity, alignment 


length, 


q. start 


q- 


end, s. 


start, s. 


end, 


evalue, bit 


STAT bind domain 


gnllCDD 1145817 68.50 


254 


317 


567 


1 


254 


3e 


-147 


424 


STAT alpha domain 


gnllCDDI144561 57.14 


182 


136 


315 


1 


182 


2e 


-70 


220 


STAT interaction domain 


gnllCDDI198032 63.33 


120 


2 


121 


1 


121 


2e 


-59 


187 


SH2 STAT family domain 


gnllCDDI 198175 49.21 


126 


557 


682 


1 


114 


4e 


-51 


164 


STAT TAZ2bind domain 


gnl!CDDI152597 68.00 


25 


715 


739 


1 


23 


6e 


-08 


40.5 


SH2_STAT1 domain 


gnllCDDI 198235 99.34 


151 


557 


707 


1 


151 


4e 


-108 


319 



PREDICTION REPORT 



Query, Subject id, STATdb Classification,! identity, aln len, evalue, bit score 

STAT_00001|STAT1 STAT_01046 DA I : A : STATl 100.00 750 0.0 1552 

See below for the alignment with the best hit 



Query: 1 

> STAT_01046 | 397509866 :XP_003825332.11DA_I: A: STATl | Pan_paniscus 
Length=750 

Score = 1552 bits (4018), Expect = 0.0, Method: Compositional matrix adjust. 
Identities = 750/750 (100%), Positives = 750/750 (100%), Gaps - 0/750 (0%) 

Query 1 MSQWYELQQLDSKFLEQVHQLYDDSFPMEIRQYLAQWLEKQDWEHAANDVSFATIRFHDL 60 

MSQWYELQQLDSKFLEQVHQLYDDSFPMEIRQYLAQWLEKQDWEHAANDVSFATIRFHDL 
Sbjct 1 MSQWYELQQLDSKFLEQVHQLYDDSFPMEIRQYLAQWLEKQDWEHAANDVSFATIRFHDL 60 

Query 61 LSQLDDQYSRFSLENNFLLQHNIRKSKRNLQDNFQEDPIQMSMIIYSCLKEERKILENAQ 120 

LSQLDDQYSRFSLENNFLLQHNIRKSKRNLQDNFQEDPIQMSMIIYSCLKEERKILENAQ 
Sbjct 61 LSQLDDQYSRFSLENNFLLQHNIRKSKRNLQDNFQEDPIQMSMIIYSCLKEERKILENAQ 120 



Query 601 LRFSESSREGAITFTWVERSQNGGEPDFHAVEPYTKKELSAVTFPDI IRNYKVMAAENIP 660 

LRFSESSREGAITFTWVERSQNGGEPDFHAVEPYTKKELSAVTFPDI IRNYKVMAAENIP 
Sbjct 601 LRFSESSREGAITFTWVERSQNGGEPDFHAVEPYTKKELSAVTFPDI IRNYKVMAAENIP 660 

Query 661 ENPLKYLYPNIDKDHAFGKYYSRPKEAPEPMELDGPKGTGYIKTELISVSEVHPSRLQTT 720 

ENPLKYLYPNIDKDHAFGKYYSRPKEAPEPMELDGPKGTGYIKTELISVSEVHPSRLQTT 
Sbjct 661 ENPLKYLYPNIDKDHAFGKYYSRPKEAPEPMELDGPKGTGYIKTELISVSEVHPSRLQTT 720 

Query 721 DNLLPMS PEEFDEVSRIVGSVEFDSMMNTV 750 

DNLLPMS PEEFDEVSRIVGSVEFDSMMNTV 
Sbjct 721 DNLLPMS PEEFDEVSRIVGSVEFDSMMNTV 750 



Classification summary 




Notation type 

Number of predicted DA types 



MI DA II DA III DA IV DA V DA U<nhrvourv> 

Oft type 



Prediction report in text can be downloaded here 
Click here for STATdb Classification details 



Figure 3. STATdbPredict output report page for STAT_00001. The alignment is cropped to save space. 
doi:10.1371/journal.pone.0104597.g003 
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provide a true representation of the STATome as the database 
grows. 

v. Predict 

STATdbPredict characterizes protein sequences through the 
STATdb classification system (Figure 2D). This prediction system 
reports the STATdb classification notation of the query sequenc- 
ed) along with any additional information, such as individual 
domain hits, the HSP, and the frequency of the different notations 
(Figure 3). This provides information on the potential structure, 
function and lineage of novel STATs, which can help in planning 
experiments for validation. STATdbPredict is essentially a 
combination of two BLAST programs: RPS-BLAST (against an 
in-house database of PSSM matrices downloaded from CDD) and 
standard BLASTp (against STATdb version without the test 
dataset) with optimized parameters that are applied in the context 
of the 3-tier classification. Outputs of STATdbPredict are 
annotated according to the classification, whereas a standalone 
BLAST search against STATdb sequences also annotated 
according to the 3-tier annotation would provide a similar result 
but of lower overall accuracy (~91% versus ~94%) and sensitivity 
(-82% versus -89%) than STATdbPredict (Table 2). This is 
because RPS-BLAST, through the use of PSSM matrices, captures 
the diversity of the domains, which cannot be represented by a 
single HSP of a BLAST search. Even though the percentage 
differences in accuracy between Predict and standalone BLAST 
seem minor, the absolute number of records affected is significant 
and will be more so for a larger data size; for example, 17 were 
incorrecdy identified by standalone BLAST for a test dataset of 96 
positive and 96 negative samples. Nonetheless, both methods have 
a high overall accuracy because of the granular stratification of 
STAT sequences into the 3-tier classification system. The 
prediction system will be updated regularly for improved reliability 
as the size of the database grows. STATdb represents a platform 
for the future development of more sophisticated meta-predictors, 
with an increased number of record and corresponding annota- 
tions for scanning the tree of life genome /proteome for novel 
STATs in practical applications. 

Application of STATdbPredict: Defining the STATome 

The STATome represents all reported STAT sequences in 
nature. The sequences used to populate STATdb were obtained 
via two approaches: (i) a standard search of NCBI NR (see 
Materials and Methods for "Sequence Data Collection") and (ii) 
STATdbPredict to scan UniProt UniReflOO [35] and NCBI NR 
datasets. At the time of collection, the UniReflOO dataset 
contained 20,002,214 sequences, whereas the NR dataset 
contained 23,075,327 sequences. The standard search returned 
1,126 STAT sequences, whereas STATdbPredict identified an 
additional 65 unique sequences from NR and 233 from 
UniReflOO. In addition, the 116 sequences identified during the 
accuracy analysis of STATdbPredict, which were obtained more 
recently using a standard NCBI NR search, were eventually 
included in STATdb, resulting in a total of 1,540 distinct sequence 
records. 

STATdb is currently the only specialised repository of the 
STATome. Of the 1,540 records (as at June 2013), 186 are 
experimentally ("E") verified STAT sequences, whereas 1,354 are 
predicted ("P") (see submission /curation policy for grouping 
procedure). A total of 93 records have annotations of the 
interacting partners, which broadly fall under four groups: 
inhibitors, such as protein inhibitor of activated STAT (PIAS), 
and suppressors of cytokine signaling (SOCS); activators, such as 
JAK, Src kinase and EGFR; cytokines, such as interferons and 
interleukins, which comprise the majority; and unclassified, such 
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as JUN, BCL3, Gfap, EP300, among others. STAT is currently 
reported to be present in 235 species from diverse lineages, 
including bilateria, cnidaria, choanoflagellida, dictyosteliida, 
placozoa, porifera, ichthyosporea, acanthamoeba and tracheo- 
phyta. The STAT types— STAT 1, STAT2, STAT4, STAT5A, 
STAT5B and STAT6 — are represented by more than 100 records 
each, whereas STAT3 and STAT(u) comprise over 200 and 300 
records, respectively (navigate to "STATdb Home > Help > 
Statistics" for the list of species and STAT types). 

Maintenance, stability and growth of STATdb 

We have devised an update policy (see http://statdb.bic.nus. 
edu.sg/downloads/update_policy.pdf) for the regular growth of 
the database. The stability of the database will be monitored 
regularly, and feedback from users will be key in addressing any 
bugs or issues within the system. Additionally, regression testing 
will be performed before major updates to ensure full functionality 
and stability. Plans for longevity of the database beyond the 
current team include a proposal for the long-term maintenance of 
the database by a group of volunteers selected from the list of top 
contributors. These users will be given the authority to make 
changes to the database in accordance with the standard system 
administrator acceptable use policy, and will also be responsible 
for maintaining the various policies of the database, such as new 
sequence submissions and update policies. Other plans include 
depositing the latest copy to Asia-Pacific Bioinformatics Network's 
(APBioNet's) cloud re-instantiation Web-accessible system (http:// 
biodbl00.apbionet.org; [36]) for archival and future on-demand 
re-instantiation by users where the original database site is not 
accessible. This is in line with the Minimum Information about a 
Bioinformatics Investigation (MIABi) standards [37], harmonised 
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Discussion 

STATdb is a unique Web resource that provides a compre- 
hensive collection of STAT protein and nucleotide sequences, 
enriched with functional and structural annotations for data 
mining and analyses. The significant attributes of STATdb 
include: (a) integration of STAT data from different databases, 
creating the only unified STATome reported to date; (b) a novel 
classification system (comprising characteristic features of STAT 
protein sequences), which is used as a basis for STATdbPredict, a 
high accuracy (>90%) meta-predictor; (c) tools to analyse the 
functional and structural properties of STAT (BLAST, alignment, 
STATdbPredict); and (d) a platform for community contribution, 
which is guided by submission curation and an update policy. We 
envisage that this database will serve as a template for the 
development of a knowledgebase for signaling proteins. 
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